                A RISC OS module for Blowfish encryption
                ========================================

    Blowfish is a block cipher designed, and made freely available, by
    Bruce Schneier. This document describes a small RISC OS
    relocatable module that performs Blowfish encryption and
    decryption in ECB, CBC and CFB modes. The module is about 8k in
    size; more than half of this size is taken up by fixed tables used
    by the key schedule. On my ARM610-based machine it encrypts and
    decrypts about 300 kilobytes per second; performance should be
    better on machines with caches larger than 4k (ARM710 and
    StrongARM being examples). A set of APCS-compliant functions
    to do the same things is distributed with the module.

The algorithm
-------------

    Blowfish is a Feistel cipher; that is, it belongs to the same
    family of ciphers as (for instance) DES and IDEA. A Feistel cipher
    encrypts a pair of words (or objects of other sizes) by performing
    a number of simple rounds of the form (x, y) := (y, x + Fn(y)),
    where + is something like exclusive-or that has easily
    computable inverses, and the functions Fn are usually rather
    simple but depend on the key in a complicated way.

    Notice that each round of a Feistel cipher is easily inverted if
    you happen to know what its F is; so the whole cipher is easily
    inverted if you know the key. Indeed, the procedure for decryption
    is almost identical to the procedure for encryption.

    In the case of Blowfish, x and y are 32-bit words; the +
    operation is indeed exclusive-or; and the Fn are obtained from two
    lookup tables constructed from the key. Specifically, if y is
    separated into its component bytes y0, y1, y2, y3 then Fn(y) =
    p[n] ^ (((s0[y0] + s1[y1]) ^ s2[y2]) + s3[y3]) where ^ is
    exclusive-or and + is ordinary addition; Ill explain the
    construction of the tables p[] and s[] in a moment.

    Thats not quite the whole story: the heart of the algorithm is 16
    rounds (the number is variable, in fact) of the sort Ive just
    described, but there are also two much simpler rounds, one before
    the interesting ones and one after, each of which has F equal to a
    constant function. Thus, round 0 is boring and has F0(y) = p[0];
    rounds 1 to 16 are interesting and are given by the expression
    above; and round 17 is boring and has F17(y) = p[17].
    
    Oh, one other annoying thing: Blowfish is designed for big-endian
    machines, so we reverse the byte order of each word before and after
    applying it.

The key schedule

    So, where do those tables come from? Obviously they must depend on
    the key. Their actual construction is rather peculiar: it goes as
    follows.

    1. Initialise the p[] and s[] tables. There are 18 entries for p
    (one for each round), and 1024 for s (four 256-entry blocks).
    Their initial values are the first 18+1024 digits of the
    base-2_32 expansion of pi3, with p coming first, then s0, , then
    s3.

    2. Now modify the p table according to the key by exclusive-or-ing
    the key with the p table. This can be done whatever the length of
    the key: go through bit by bit, and when you run off the end of
    the key just start again at the beginning. (I say bit by bit,
    but probably in all real applications the key will be an integer
    number of bytes.)

    3. Finally, mix all this information up as follows. Begin with the
    arbitrary value (0,0) for (x,y); apply Blowfish using the tables
    we have constructed so far; replace the first two entries in p
    with the new (x,y). Now do the same thing again (starting with
    these x,y), and replace the next two entries in p with the new
    (x,y). Keep on doing this until all entries in p and s have been
    replaced (using the same order as in step 1). This requires a
    total of 521 applications of Blowfish, so it's a lengthy process.

    4. The tables are now ready for real encryption.

ECB and CBC

    So, Blowfish (with a given key) transforms 64-bit values into new
    64-bit values. The obvious way to use it for encryption is to chop
    your message into 64-bit blocks, apply Blowfish to each block and
    concatenate the results. This is the so-called Electronic Code
    Book mode. It's quick and simple, but it has some disadvantages,
    stemming from the fact that the ciphertext corresponding to a
    given 64-bit piece of plaintext is always the same: if your
    plaintext is quite repetitive, this can be a problem.

    A simple way of getting round this problem is to use whats called
    Cipher Block Chaining mode, which is built from ECB mode in a
    way reminiscent of the way in which a Feistel cipher is
    constructed from its individual rounds. The output of each
    encryption step is combined with the next block of plaintext to
    form the input to the next encryption step. Heres how it works.

    Suppose your input is the sequence of blocks x0 x1 x2 ; then your
    output will be a sequence of blocks y0 y1 y2  where yn =
    f(xn^yn1). Here f is what the cipher does, so ECB mode is given
    by yn = f(xn). My more astute readers will be wondering how you
    get started, since there is no y1. The answer is that to use CBC
    mode you have to supply a value for y1 at the outset; this is
    usually called an IV, short for Initial Value. At the other end,
    you get the final IV (so to speak) which will be useful if your
    stream of blocks resumes at a later date.

    If you know the key (i.e., can decrypt messages in ECB mode) then
    its straightforward to decrypt messages in CBC mode; proving this
    is an easy exercise for the reader.

    ECB and CBC modes arent so useful if you want to encrypt a stream
    of bytes rather than of blocks. You could pad the last block out
    in some well-defined way and use ECB or CBC, but there might be
    situations in which you dont actually know whether the bytestream
    has ended or not. One solution to this is called Cipher Feedback
    Mode, or CFB for short. This resembles CBC mode, but for CFB mode
    the definition is yn = xn ^ f (yn1), with the same conventions
    about IVs as for CBC. The point of this is that the first k bytes
    of yn depend only on the first k bytes of xn, so you can compute
    it one byte at a time. Like CBC, CFB is decrypted in much the same
    way as it is encrypted.

The module
----------

    The Blowfish module provides a SWI interface and a fast
    direct-call interface to routines for key generation and
    encryption and decryption in ECB, CBC and CFB modes.

    In what follows, a key buffer means a word-aligned 4168-byte
    region of memory; an expanded key is a key buffer containing data
    obtained by the procedure described above under the heading The
    key schedule. A block means 64 bits of data, either word-aligned
    in memory or held in two registers, the first word being in the
    lower-numbered register; a reversed block means the same as a
    block, but with the halves the other way around. An IV means a
    block used as an initial or final value in CBC or CFB mode.
    rA,B denote X means that rA points to a region of memory rB bytes
    in size containing X; rA,B denote aligned X means the same, with
    the proviso that rA is word-aligned and rB is a multiple of 8.

Blowfish_GetRoutines

    fills in a user-supplied buffer with the addresses of the routines
    used by the other SWIs, so that they can be called directly for
    greater speed.

    On entry, r0 should point to the buffer, and r1 should be the
    number of addresses its expected to contain. On exit, r0 points
    just past the last entry placed in the buffer; r1 contains the
    number of spaces left unfilled in the buffer. If there werent
    enough addresses to fill the buffer, then the last entry placed
    there is 0, r0 has been advanced past it, but the count in r1
    considers that space to be unfilled.

    There is one routine for each of the SWIs that follow, in order;
    the entry and exit conditions of these routines are the same as
    those for the SWIs, except that the direct-call routines typically
    corrupt many registers. They also do not check the alignment of
    the data passed to them, whereas the SWIs do.

    Warning: if the module moves then all these direct-call addresses
    will become invalid. It will not move as a result of being
    RMTidyed, because it rudely refuses to die temporarily, but if
    e.g. a new version of the module is loaded then you should not
    under any circumstances continue to use the addresses provided by
    the old version.

    This SWI preserves all registers other than r0 and r1, and
    preserves the flags. It has no direct-call version.

Blowfish_ExpandKey

    transforms a user-supplied key into an expanded key by the
    procedure described above.

    On entry, r0,1 denote the key; r2 points to a key buffer. On exit,
    r0 and r1 are corrupted; all other registers, and the flags, are
    preserved; and the key buffer is filled in.

    The direct-call version of this SWI corrupts registers r0r10.

Blowfish_Encrypt2

    applies the Blowfish algorithm to a single 64-bit block.

    On entry, r2 points to an expanded key; r3,4 contain a block of
    plaintext. On exit, r0 and r1 are corrupted; r2 is preserved; r3,4
    contain a reversed block of ciphertext. All other registers, and
    the flags, are preserved.

    The direct-call version of this SWI also corrupts registers
    r5r10 and doesn't byte-reverse the output registers. (It does
    do byte-reversal on the input registers.)

Blowfish_Decrypt2

    applies the Blowfish algorithm in reverse to a single 64-bit
    block.

    On entry, r2 points to an expanded key; r3,4 contain a block of
    ciphertext. On exit, r0 and r1 are corrupted; r2 is preserved;
    r3,4 contain a reversed block of plaintext. All other registers,
    and the flags, are preserved.

    The direct-call version of this SWI also corrupts registers
    r5r10 and doesn't byte-reverse the output registers. (It does
    do byte-reversal on the input registers.)

Blowfish_EncryptBlock

    applies the Blowfish algorithm to a number of 64-bit blocks.

    On entry, r0,1 denote aligned plaintext; r2 points to an expanded
    key; r3,1 denote an aligned output buffer. On exit, the ciphertext
    is placed in the output buffer. r0 and r1 are corrupted; r2 is
    preserved; all other registers, and the flags, are also preserved.

    The direct-call version of this SWI also corrupts registers
    r3r10.

Blowfish_DecryptBlock

    applies the Blowfish algorithm in reverse to a number of 64-bit
    blocks.

    On entry, r0,1 denote aligned ciphertext; r2 points to an expanded
    key; r3,1 denote an aligned output buffer. On exit, the plaintext
    is placed in the output buffer. r0 and r1 are corrupted; r2 is
    preserved; all other registers, and the flags, are also preserved.

    The direct-call version of this SWI also corrupts registers
    r3r10.

Blowfish_EncryptBlockCBC

    applies the Blowfish algorithm in CBC mode to a number of 64-bit
    blocks.

    On entry, r0,1 denote aligned plaintext; r2 points to an expanded
    key; r3,4 contain an IV. On exit, the plaintext is replaced by
    ciphertext. r0 and r1 are corrupted; r2 is preserved; r3,4 contain
    an updated IV to use for subsequent calls to this SWI; all other
    registers, and the flags, are also preserved.

    The direct-call version of this SWI also corrupts registers
    r5r12.

Blowfish_DecryptBlockCBC

    applies the Blowfish algorithm in reverse CBC mode to a number of
    64-bit blocks.

    On entry, r0,1 denote aligned ciphertext; r2 points to an expanded
    key; r3,4 contain an IV. On exit, the ciphertext is replaced by
    plaintext. r0 and r1 are corrupted; r2 is preserved; r3,4 contain
    an updated IV to use for subsequent calls to this SWI; all other
    registers, and the flags, are also preserved.

    The direct-call version of this SWI also corrupts registers
    r5r12.

Blowfish_EncryptBlockCFB

    applies the Blowfish algorithm in CFB mode to a number of bytes.

    On entry, r0,1 denote plaintext; r2 points to an expanded key; r3
    points to an IV; and r4 should be equal (mod 8) to the number of
    bytes encrypted so far. On exit, r0 and r1 are corrupted; r2 and
    r3 are preserved; r4 is updated. All other registers, and the
    flags, are also preserved.

    The direct-call version of this SWI also corrupts registers
    r5r12.

Blowfish_DecryptBlockCFB

    applies the Blowfish algorithm in reverse CFB mode to a number of
    bytes.

    On entry, r0 denote plaintext; r2 points to an expanded key; r3
    points to an IV; and r4 should be equal (mod 8) to the number of
    bytes decrypted so far. On exit, r0 and r1 are corrupted; r2 and
    r3 are preserved; r4 is updated. All other registers, and the
    flags, are also preserved.

    The direct-call version of this SWI also corrupts registers
    r5r12.

Other things
------------

Performance

    A very unscientific test suggests that, on my machine (ARM610,
    30MHz), one one occasion:

	Expanding a key takes about 0.013 seconds;
	Encrypting or decrypting 512k of data in ECB mode takes
	about 1.65 seconds;
	Encrypting or decrypting 512k of data in CBC mode takes
	about 1.7 seconds;
	Encrypting or decrypting 512k of data in CFB mode takes
	about 2.1 seconds.

    Running Blowfish on a large amount of data does a lot of random
    access to the expanded key; as this is a little over 4k in size,
    the 4k cache size on the ARM610 hurts. I anticipate much better
    performance on machines with larger caches. Id be grateful for
    confirmation or refutation of this.

    It wouldnt be difficult to make CFB mode more efficient; it could
    be made about the same speed as CBC mode. This would enlarge the
    module slightly, and in any case Im lazy. If its important to
    you, tell me.

The author

    Im Gareth McCaughan; you can contact me by e-mail to
    gjm11@dpmms.cam.ac.uk or snail mail to Peterhouse, Cambridge.

Legal stuff

    I own the copyright on the Blowfish module. If you have a copy,
    then: You may use it yourself for any purpose. You may distribute
    it unaltered providing you do not make money from it. You may
    distribute altered versions provided they (1) still contain my
    copyright messages, and (2) are clearly indicated as being altered
    versions. You may do anything at all if you have my permission.
    You may do nothing else with it.

    The algorithm is Bruce Schneiers. Im not sure what moral and
    legal rights he has, or how many of them he has waived. I do know
    that free availability was one of his primary concerns in
    producing Blowfish. In any case, I request that if you distribute
    this module in any form you credit Schneier with the design of the
    algorithm.

Andy Armstrongs Blowfish module

    Andy Armstrong (of Wonderworks) has also written a module to do
    Blowfish encryption. (I had forgotten this when I wrote the first
    version of mine!) You should use my module instead; it provides
    a compatible SWI interface (not documented here; you should use
    the calls above in preference) and has some important advantages.
    Because this module extends the facilities of Andys, I have
    (with his permission) given it the same name and SWI chunk number.

    About 420 bytes of the module are compatibility code.
